Variational approach to protein design and extraction of interaction potentials 
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We present and discuss a novel approach to the direct and inverse protein folding problem. The 
proposed strategy is based on a variational approach that allows the simultaneous extraction of 
amino acid interactions and the low-temperature free energy of sequences of amino acids. The 
knowledge-based technique is simple and straightforward to implement even for realistic off-lattice 
proteins because it does not entail threading-like procedures. Its validity is assessed in the context 
of a lattice model by means of a variety of stringent checks. 



Two long standing challenges in molecular biology are 
the direct and inverse problems of protein foldingll. The 
first deals with the determination of the thermodynami- 
cally jSjtable conformation of a known sequence of amino 
acidsBB while the second involves the elucidation of the 
amino acid sequence (if any) which adiuits a given target 
structure as its stable conformatiorDcl. One route to a 
solution of the direct problem requires knowledge of the 
interaction potentials between the protein constituents 

- in principle one studies the energies of the sequence 
in various conformations and identifies the native [State 
structure as being the one with the lowest energya. A 
solution of the inverse or the sequence design problenn 
entails knowledge of the free energies of the sequencesO. 
This follows from an application of Boltzmann statistics 

- what matters is not how low the energy of a sequence 
is in the target conformation (a measure of which can 
be obtained from the knowledge of the interaction po- 
tentials) but whether this energy is lower than those in 
alternative conformations. 

In this report, we introduce a variational approach for 
extracting the interaction potentials between the protein 
constituents and the free energies of candidate sequences 
simultaneously. The method is general and applicable 
to real proteins. The input is a set of sequences and 
their respective native structures, as available from the 
Protein Data Bank. A feature of the technique is that 
alternative conformations that compete with the native 
state in housing a given sequence are not required. Here 
we apply this method to a lattice model of proteins to 
provide a stringent validation of the approach. Unlike 
real proteins, the interaction potentials in such a model 
system can be chosen and one may measure the accuracy 
with which these potentials are determined as well as the 
effectiveness of sequence design. 

We adopt the approach of treating proteins at a meso- 
scopic level with the amino acids being the fundamental 
units. The infiuence of the internal degrees of freedom 
associated with each amino acid as well as the solvent 
degrees of freedom are incorporated through a coarse- 



grained Hamiltonian with effective interactions between 
the amino acids. 

The free energy, F(S), of a sequence, S, is defined from 
the equation 



g-/3F(S) ^ ^ g-/5-H(S,r) 



(1) 



where T-L{S,T) is the energy of S mounted on a confor- 
mation r and the sum is taken over all conformations 
that the sequence can adopt. A rigorous solutionB of 
the design problem on a target structure F entails the 
identification of the sequence(s), S, that maximizes the 
functional 



Pt{S) 



(2) 



evaluated at low temperature ( below the folding tran- 
sition temperature where ^'r(<5') — 1/2). Pr{S) is the 
probability that a sequence S is found in conformation T 
at an inverse temperature (3. Thus the solution S is the 
sequence which has the highest low-temperature P^b- 
ability of being found on F. At low temperatureald, a 
sequence S with a unique ground state, F, satisfies the 
inequality 



H{S, F) - F{S) < His, F) - F{S), 



(3) 



for arbitrary sequence, S, with the equality possibly hold- 
ing only when S admits F as its native state. A range of 
such equalities could be used to determine optimal values 
of variational parameters characterizing the interactions 
and the low temperature free energies. 

The maximization of Pr{S) is computationally de- 
manding because it involves the calculation of F{S) for 
each amino acid sequence and an exact calculation of 
^"(5*) for a given sequence S involves a sum over the 
enormous number of its possible conformations. The use 
of importance sampling techniques for the estimation of 
F{S) at low-T requires efficient algorithms to find con- 
formations that compete significantly with Fd. Such an 
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approach has been used fruitfully for lattice models of 
proteinsa, but is not feasible for realistic off-lattice casesB. 

F{S) formally depends only on S and hence one may 
postulate a functional form of F which depends on 
sequence properties- (e.g. the concentration of amino 
acids)ElEl. At T = 013, the free energy of a sequence ought 
to be exactly equal to its energy in the native state con- 
formation (which depends on the conformation and the 
interaction potentials) - this forms the basis for our vari- 
ational approach. Unlike the inequalities (^), the new 
approach does not entail the mounting of a sequence on 
any but its own native state conformation. We define 
an intensive functional A (whose choice is not unique), 
whose minimization can be used to identify a consistent 
set of potential and free energy parameters. A convenient 
choice that we used in our calculations is 



.f n{S,,T,)~F{S,) 



e[F(5,)-H(5„r,)] , (4) 



( ■H{S^,Ti)-F{S.,) 



where Q [x\ is the Heavside function, and the sum is taken 
over the sequence-native state conformation set in the 
protein data bank, and Li is the length of the ith se- 
quence. The second term in (^) is used to penalize cases 
for which the parameters violate the physical constraint, 
H{Si,ri) > F{Si). The quantity e is the absolute value of 
the average of the interaction strengths between amino 
acids and its utility is explained below. A zero value 
for A would correspond to a perfect parametrization of 
both the interaction potentials and the free energies for 
the finite set of sequences in the data bank. More gen- 
erally, for a finite protein data bank, there will exist a 
non-zero region in the parameter space of potentials and 
free energies within which A is at a minimum. With 
perfect parametrization, this region would be expected 
to shrink arovaid the parameter values as the data bank 
size increasesllil. It should be stressed that, contrary to 
common potential extraction or design procedures, the 
minimization of the functional (^ does not involve the 
use of decoy structures nor the mounting of sequence i on 
any structure other than its ground state, F^. In order to 
create a data bank, a random exploration of the ensemble 
of 4-amino-acid sequences of length 16 was performed to 
select those admitting a unique ground state conforma- 
tion. The possible protein conformations were assumed 
to be aeJf-avoiding oriented walks embedded on a square 
latticetj with an interaction between amino acids i and 
j only if they are next to each other on the lattice and 
yet not next to each other along the sequence. 

We chose an interaction matrix, e, between the 4 dif- 
ferent types (or classes) of amino acids. These are the 
entries of the 4x4 e-matrix in the first row of Table || 
(with ei,i = -40). 

To mimick the thermodynamic stability of proteins, we 
further selected the sequences and retained only those 



with an energy gap between the unique native state 
and first excited state energies > 10, a constraint sat- 
isfied by, roughly, 1% of the sequences. Our final data 
bank consisted of 500 sequences with their ground state, 
{•Si, Filial. ..500. 

In our model studies, we chose to parametrize the in- 
teraction matrix with the same functional form as the 
true interaction matrix but with nine variational param- 
eters in the symmetric e matrix (ei^i was held fixed at a 
value of -40 in order to set the energy scale) . We assumed 
the simplest form for an extensive free energyoEl with four 
variational parameters (denoted by a^, i = 1...4): 



F{S) — ai ■ ni + a2 ■ n2 + ■ 713 + a4 ■ n4 , 



(5) 



where Ui is the number of amino acids of type i found in 
S. Eq. (^) may be viewed as the lowest order expansion 
of F in the "order parameters", n^'s. 

A was minimized using a simulated annealing proce- 
dure with the constraint ei^i < e^^ < 0. The hierarchy 
of amino acids interaction strengths was mirrored by the 
frequencies of pair contacts in the data bank and allowed 
for a restriction of the search in parameter space. 

The quantity e in (|^) was useful in avoiding conver- 
gence to a spurious trivial solution in which all the Cij 's 
are equal to ei,i = —40, and F{S) becomes (-40) times 
the number of contacts. 

The functional A was minimized using subsets of our 
global data bank within which the number of elements 
ranged from 100 to 500. The extracted potentials, as 
well as the free energy coefficients appear in Table |. We 
further checked, using the extracted parameters, whether 
each sequence in the data bank recognized the associated 
structure as its ground state among all the possible con- 
formations. The success rate was typically > 80% with 
an increase in the success rate on increasing the size of 
the data bank. 

We then proceeded to use the functional {H — F) to 
carry out sequence design on a target structure. This en- 
tails the identification from among the 4^^ sequences the 
one that minimizes {H — F) (using the extracted param- 
eters) on the target structure F. The correctness of the 
design is checked by using the true Hamiltonian to verify 
whether the designed sequence admits F as its (possibly 
degenerate) ground state. Our test was performed on 
100 structures taken from our data bank using a Monte 
Carlo procedure. 

Fig. |l| shows a plot of the design success rate as a func- 
tion of the size of the training set. It is worth noting that 
none of the designed sequences appeared in the original 
data bank. Our analysis was not limited to those se- 
quences with the lowest {H — F) score; we extended it to 
the 10 highest ranking sequences for each target struc- 
ture. Using the parameters deduced from the training 
set of size 500, we found an excellent overall design suc- 
cess of 88% and 92% for unique and degenerate encoding 
respectively. 

In Fig. 1^ we have plotted the histogram of the (H — F) 
distribution for the improperly chosen sequences (black) 
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and the correct ones (gray) . The {H — F) score for the 
improperly chosen sequences take on large positive val- 
ues, signalling that the estimated energy, F of the se- 
quence in its unknown native state is substantially lower 
than in the target structure. Thus one may discard a 
priori the majority of bad solutions by a mere inspection 
of their large {H — F) scores. The unphysical negative 
values of {H — F) originate from the small size of the 
training set and the imperfect parametrization of the free 
energy. We also considered several generalizations of (|^) 
including two-body terms of the form rii ■ nj and chemical 
potentials that control the numbep, of "walls" separating 
segments of identical amino acidsa with slight improve- 
ment in the success rates. A further check of the quality 
of the designed sequences was performed by inspecting 
the distribution of their energy gaps versus those used in 
the data bank. The designed sequences tend to have en- 
ergy gaps between the native state and the first excited 
state that are larger than those of sequences in the data 
bank (Figure ||) showing that the design procedure yields 
sequences with a higher thermodynamic stability. 

Finally, we performed a challenging blind test to assess 
the validity of the variational approach. The coefficients 
extracted for the 16-bead case were used to carry out se- 
quence design on a compact target conformation of length 
25. The target conformation was RRDLDRDDLULDL- 
LURULUURDRD (where R,L,U,D stand for right, left, 
up and down respectively and indicate the directions of 
the bonds that define the self-avoiding lattice conforma- 
tion) isihich is highly designableEj inj-a different lattice 
modelllJ and is geometrically regulaiO. The sequence 
chosen was 124211324211324211324211. Indeed, an ex- 
haustive search of the native state of the sequence among 
all self-avoiding walks of length 25 confirmed that this 
sequence had the target structure as its unique ground 
state. 

In order to ensure that the strategy used here is robust 
and independent of the particular choice of the e matrix 
and/or data bank, we performed a similar analysis us- 
ing another randomly generated interaction matrix and 
found results of statistically similar quality as summa- 
rized in the bottom of Table |[ 

Our results show that one may define a design score, 
A, that takes on small values for sequences mounted on 
their true native state and large positive values for im- 
proper mounting. It is striking that the simple free en- 
ergy form as in (|^) can be so effective for building a re- 
liable A functional. A physically appealing explanation 
for this is to regard the parameters in (^) as controlling 
both the residue composition of the designed sequence as 
well as indicating its expected ground state energy. The 
solution to a design problem will be provided by the se- 
quence(s) that meets the composition requirements and 
which, when mounted on the target structure, has an en- 
ergy equal to or better than the expected value. Thus, 
the variational approach provides a feedback mechanism 
for design; it is self-regulating in that no external action 
is required to rule out runaway solutions favouring the 



abundance of the most energetically favoured contacts. 
This self-regulating mechanism also counterbalances an 
improper parametrization of H and/or F, thus decreas- 
ing the sensitivity of the overall {H — F) score to the 
detailed functional form of A. 

In conclusion, we have presented a novel procedure for 
tackling the direct and inverse folding problems simulta- 
neously. The proposed strategy is general and ought to 
be applicable to the case of real proteins. We have dis- 
cussed a practical implementation of the technique and 
have carried out rigorous testing of its efRciency in folding 
and design. The results are encouraging and are sugges- 
tive of the feasibility of a simple parametrization of the 
free energy of sequences of amino acids. 

This work was supported in part by INFN sez. di Tri- 
este, NASA, NATO and the Center for Academic Com- 
puting at Penn State. 
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FIG. 3. Distribution of the energy gaps between the na- 
tive state and first excited state energies for the sequences 
in the data bank (gray) and designed sequences that have a 
non-degenerate ground state (black). 



FIG. 1. Plot of the success rate in identifying the sequence 
that admits a pre-assigned target structure as its degenerate 
(squares) and non-degenerate (circles) ground state as a func- 
tion of the training set size. The results were obtained with 
a single randomly chosen set of each size. 
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FIG. 2. Distribution of the quantity [H — F) for the cor- 
rectly chosen sequences (gray) and the improper sequences 
(black). 
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PDB Size 


ei,2 


£1,3 


£1,4 


£2,2 


£2,3 


£2,4 


£3,3 


£3,4 


£4,4 




ffl2 


ffl3 


04 


TRUE 


-30 


-20 


-17 


-25 


-13 


-10 


-5 


-2 


-1 










100 


-32.63 


-26.80 


-22.71 


-31.57 


-22.71 


-17.76 


-17.76 


-17.75 


-0.00 


-26.50 


-17.44 


-12.60 


-10.39 


200 


-32.62 


-25.56 


-23.17 


-30.65 


-23.17 


-16.06 


-14.07 


-5.43 


-0.00 


-26.15 


-17.14 


-12.33 


-11.07 


300 


-31.27 


-27.91 


-23.29 


-27.91 


-23.17 


-12.61 


-8.50 


-8.50 


-0.00 


-27.17 


-15.62 


-11.83 


-10.87 


400 


-31.03 


-27.14 


-23.27 


-27.14 


-22.44 


-12.50 


-6.94 


-6.77 


-0.00 


-27.62 


-15.25 


-11.52 


-10.38 


500 


-32.23 


-25.93 


-24.12 


-28.55 


-22.78 


-16.40 


-11.79 


-9.13 


-5.21 


-25.63 


-16.49 


-10.53 


-9.55 


TRUE 


-22 


-18 


-12 


-11 


-17 


-1 


-28 


-13 


-1 










250 


-24.05 


-17.95 


-13.22 


-13.02 


-17.95 


-0.00 


-24.06 


-13.22 


-0.00 


-26.26 


-10.41 


-12.70 


-4.14 



TABLE I. A summary of the results with two data banks containing 500 and 250 training proteins respectively. In all cases 
€1,1 was fixed at -40 in order to set the energy scale. The row entitled TRUE shows the true potential parameters in both 
cases. The other rows show the values of the extracted parameters of the potential and the free energy with the number in the 
first column showing the number of proteins in the training set. A single randomly chosen set was employed in each case. For 
the second data bank, the folding success rate was 91 %, while the unique and degenerate design success rates were 73% and 
96% respectively. 
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